Computerized device with voice command input capability

ABSTRACT

A computerized device with voice command capability processed remotely includes a low power processor, executing a loose algorithmic model to recognize a wake word prefix in a voice command, the loose model having a low false rejection rate but suffering a high false acceptance rate, and a second processor which can operate in at least a low power/low clock rate mode and a high power/high clock rate mode. When the first processor determines the presence of the wake word, it causes the second processor to switch to the high power/high clock rate mode and to execute a tight algorithmic model to verify the presence of the wake word. By using the two processors in this manner, the average overall power required by the computerized device is reduced, as is the amount of waste heat generated by the system.

FIELD OF THE INVENTION

The present invention relates to computerized devices. Morespecifically, the present invention relates to computerized devices suchas smartphones, HVAC controllers, light switches, power outlets, garagedoor opener controllers, remote sensors and the like which include theability to act as voice command inputs to a voice command recognitionsystem.

BACKGROUND OF THE INVENTION

Recently, access to relatively sophisticated remote processing systemshas become available through data networks such as the Internet. Such socalled “cloud-based” processing services can provide the results ofsophisticated and/or computationally complex processes to be provided tocomputerized devices which would otherwise not be able to implement suchservices.

An interesting example of such a capability is voice recognition, whichby employing analytic models with high levels of computationalcomplexity, can provide very good recognition rates for spoken commandsand phrases. The SIRI voice assistant implemented by Apple and the ALEXAVoice Service provided by Amazon are two examples of voice recognitionsystems which employ cloud-based processing centers to achieve theirvoice recognition capabilities.

To use such a system, a user will say a predefined word or phrase,referred to herein as a “wake word”, followed by a spoken command in thepresence of a voice command input device. In such systems, the voicecommand input device (an Amazon ECHO, etc. for ALEXA and an iPhone, etc.for SIRI) continually captures and monitors an audio stream picked upvia one or more microphones on the device. The voice command inputdevice listens for the predefined “wake word” to be spoken by a userwithin audio pick up range of the device, followed by a command. Anexample of valid command to such a system could be, “ALEXA, what is thetime?”, where “ALEXA” is the wake word.

The captured audio stream is processed by the voice command input deviceto detect when/if the wake word has been spoken by a user. When such apositive determination is made, the voice command input device connectsto the associated cloud-based processing service and streams the audiocaptured by the voice command input device to that processing service(i.e.—in the case of the Echo, to the Amazon Voice Service).

The processing service analyzes the received audio stream to verify thepresence of the wake word and to determine the command, if any, spokenby the user. The processing service then determines the appropriateresponse and sends that response back to the device (i.e.—a voicemessage such as, “It is now 3:04 PM”) or to another system or device asappropriate. The range of possible responses is not limited and caninclude voice and/or music audio streams, data, commands recognized byother connected devices such as lighting controls, etc.

The use of a cloud-based processing service is preferred for suchsystems as the computational complexity to appropriately analyze thereceived audio to determine content and meaning is very high, and ispresently best implemented in special purpose hardware such as GPU orFPGA based processing engines. Such hardware is too expensive,physically too large and/or has power requirements that exceed thatavailable in many computerized devices, especially those powered bybatteries, and thus cannot be included in many computerized devices suchas smartphones, HVAC controllers, light switches, etc.

Therefore, the ability to provide voice command capabilities to controlcomputerized devices, especially those such as computerized lightswitches or power outlets and other so called Internet of Things devices(“IoT”), is a very desirable thing to do as many such computerizeddevices cannot reasonably or economically be provided with hardware suchas keyboards, touchscreens or the like to otherwise allow control of thedevices.

However, the computational requirements for a computerized device toreliably interact with a cloud-based processing service are not easilymet by many computerized devices, hence the current need for voicecommand input devices such as the Echo device and/or Google's Homedevice. In particular, the voice recognition models which are requiredto be executed by the voice command input device to capture andrecognize the watch word require processors with high computationalcapabilities to be employed in the voice command input device. Such highpowered processors generate significant amounts of waste heat whenoperating, which can be a problem in many other devices, such as IoTdevices, or consumer worn or carried devices or HVAC controllers andremote sensors. Further, such high powered processors generally requirea significant amount of electrical power, which can be a problem forbattery powered, or parasitically powered devices (i.e.—computerizeddevices which obtain their operating power from the control signals ofthe devices they are controlling). Additionally, the time required toprocess the wake word, prior to relaying the voice command to the cloudservice adds to the voice system's overall latency, decreasing usersatisifaction.

Unfortunately, the cost of special purpose voice command input devicessuch as the Echo and/or Home can slow the adoption and use of suchservices. It is desired to have a system and method of providing acomputerized device with voice command input capability in a reliable,economical and cost effective manner.

SUMMARY OF THE INVENTION

It is an object of the present invention to provide a novel system andmethod for providing computerized devices with voice command capabilityprocessed remotely and which obviates or mitigates at least onedisadvantage of the prior art.

According to a first aspect of the present invention, there is provideda method of recognizing a wake word for voice commands to a computerizeddevice, comprising the steps of: (i) receiving at the computerizeddevice an audio signal from at least one microphone; processing thereceived audio signal with a first processor in the computerized device,the first processor placing a copy of the processed received audiosignal into a circular buffer of a preselected size and the firstprocessor executing a first voice recognition algorithmic model todetect the presence of a predefined wake word, the first voicerecognition algorithmic model selected to provide a predefined relativelow level of false non-matches of the predefined wake word at the costof a higher than predefined level of false matches of the predefinedwake word; upon the first processor determining a match of thepredefined wake word, the first processor providing a signal to a secondprocessor in the computerized device, the second processor normallyoperating at a first rate having a first computational capacity, thesignal causing the second processor to commence operating at a secondrate having a second computational capacity greater than the firstcomputational capacity, and the second processor: (a) copying thecontents of the circular buffer into a second buffer; (b) copying theprocessed received audio signal into a third buffer; executing a secondvoice recognition algorithmic model on the contents of the second bufferto verify the presence of the predefined wake word, the second voicerecognition algorithmic model requiring greater computational processingthan the first voice recognition algorithmic model to achieve apredefined relatively low level of both false non-matches and falsematches of the predefined wake word; (c) upon completion of analyzingthe contents of the second buffer with the second voice recognitionalgorithmic model, if the second voice recognition algorithmic modeldetermines that the predefined wake word is not present in the secondbuffer, returning the second processor to operate at the first rate and,if the second voice recognition algorithmic model determines that thepredefined wake word is present in the second buffer, then forwardingthe contents of the second buffer and the third buffer to a voiceprocessing service located remote from the computerized device, thevoice processing service operable to receive and process voice commands.

Preferably, the voice processing service executes a third voicerecognition algorithmic model requiring greater computational processingthan the second voice recognition algorithmic model, the voiceprocessing service executing the third voice recognition algorithmicmodel on the copy of the second buffer received at the voice processingservice to verify the presence of the wake word therein and, if thethird voice recognition algorithmic model does not verify the presenceof the wake word, then the voice processing service sending a message tothe computerized device indicating that the wake word was not presentand the second processor returning to operating at the first rate and,if the third voice recognition algorithmic model does verify thepresence of the wake word, then the voice processing service processingthe contents of the third buffer.

According to another aspect of the present invention, there is provideda computerized device comprising: at least one microphone to captureuser voices; a first processor to digitize and process audio receivedfrom the at least one microphone and to store a copy of the processedaudio in a circular buffer and to execute a first voice recognitionalgorithmic model to detect the presence of a predefined wake word inthe circular buffer, the first voice recognition algorithmic modelselected to provide a predefined relative low level of false non-matchesof the predefined wake word at the cost of a higher than predefinedlevel of false matches of the predefined wake word; a second processornormally operating at a first rate having a first computational capacityand responsive to a signal from the first processor indicating that thefirst voice recognition algorithmic model has detected the presence ofthe wake word in the circular buffer such that the second processorcommences operation at a second rate having a greater computationalcapacity that the capacity at the first rate, the second processorreceiving a copy of the contents of the circular buffer from the firstprocessor and receiving and buffering a copy of the processed receivedaudio stream in a second buffer, the second processor executing a secondvoice recognition algorithmic model on the copy of the contents of thecircular buffer to verify the presence of the predefined wake word, thesecond voice recognition algorithmic model requiring greatercomputational processing than the first voice recognition algorithmicmodel and being selected to achieve a predefined relatively low level ofboth false non-matches and false matches of the predefined wake wordhigher than achieved by the first processor; a data communicationsmodule operable to provide data communication between the computerizeddevice and a remote voice processing service, the data communicationsproviding the voice processing service with the copy of the contents ofthe circular buffer and the contents of second buffer to the voiceprocessing service when the second voice recognition algorithmic modelverifies the presence of the wake word in the copy of the contents ofthe circular buffer.

The present invention provides a computerized device with voice commandcapability that is processed remotely. The device includes a low powerprocessor, executing a loose algorithmic model to recognize a wake wordprefix in a voice command, the loose model having a low false rejectionrate but suffering a high false acceptance rate, and a second processorwhich can operate in at least a low power/low clock rate mode and a highpower/high clock rate mode. When the first processor determines thepresence of the wake word, it causes the second processor to switch tothe high power/high clock rate mode and to execute a tight algorithmicmodel to verify the presence of the wake word. By using the twoprocessors in this manner, the average overall power required by thecomputerized device is reduced, as is the amount of waste heat generatedby the system.

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred embodiments of the present invention will now be described, byway of example only, with reference to the attached Figures, wherein:

FIG. 1 shows a perspective view of a computerized device in accordancewith an embodiment of the present invention;

FIG. 2 shows a rear view of the computerized device of FIG. 1;

FIG. 3 shows a side view of the computerized device of FIG. 1 with atrim plate shown in phantom;

FIG. 4 shows a block diagram of the hardware of the device of FIG. 1;

FIG. 5 shows a data flow of the processing of received audio at thedevice of FIG. 1; and

FIG. 6 shows an exploded view of the hardware of the device of FIG. 1,including its speaker system.

DETAILED DESCRIPTION OF THE INVENTION

An example of a computerized device in accordance with the presentinvention is shown in FIGS. 1 through 4, wherein a computerized HVACcontroller is indicated generally at 20. While much of the followingdescription describes an HVAC controller, the present invention is notso limited and it is intended that the system and method can be employedin a wide range of computerized devices such as smartphones, smartwatches, garage door opener controllers, internet connected homeappliances, handheld computing devices, etc.

Device 20 comprises a housing 24 with a front face 28 that includes atleast a portion which is transparent and through which a touchscreen 32can be viewed and interacted with. Front face 28 can also be equippedwith a motion sensor (not shown), which can be used as an occupancysensor, detecting a user's presence and/or proximity to device 20.

Touchscreen 32 can display a wide variety of information, includingoperating messages, command response text, icons, controls and menus andcan receive inputs from a user to vary operation of device 20 ifdesired.

Device 20 further includes a pair of spaced microphone apertures 36which allow sounds from outside housing 24 to reach one or more internalmicrophones (described below) and a speaker grate 40 (best seen in FIGS.2 and 3) which allows sounds emitted from an internal speaker (alsodiscussed below) to exit housing 24. Device 20 further includes anactivity indicator 44, which can be a light pipe driven by one or moreLEDs, a lamp assembly, etc. Housing 24 will typically be mounted to awall or other surface via a trim plate 46 (shown in phantom in FIG. 4)or direct mounted to the wall (not shown) without a trim plate. Spacedaround the device are plurality of cooling vents 48.

Housing 24 further includes a bevel 50 on its rear face (best seen inFIGS. 2 and 3), which creates an air gap between the device 20 and thetrim plate 46 or wall. In particular, bevel 50 includes an increasedslope area 52 on its bottom edge around speaker grate 40. Speaker grate40 is spaced far apart form microphone apertures 36 to minimizeinterference.

FIG. 4 shows a block diagram of the internal hardware 42 of device 20,and FIG. 6 shows an exploded view of internal hardware 42. Hardware 4220 includes an application processor 100, which can be a microprocessoror any other suitable device as will occur to those of skill in the art.Processor 100 is capable of running at different clock rates, to matchavailable program execution rates to computational needs, which canchange from time to time. Such multi rate processors are well known.Device 20 further includes memory 104, which can be non-volatile RAMand/or volatile RAM which is accessible by processor 100. As will beapparent to those of skill in the art, memory 104 can be integral withprocessor 100, or can be separate discrete devices or components, asdesired.

Typically, memory 104 will store one or more programs for execution byprocessor 100, as well as various parameters relating to the executionof the programs and data and working values required by the programs.

Touchscreen 32 is operatively connected to processor 100, as is themotion sensor (if present), and device 20 further preferably includes areal time clock, either as a service provided in processor 100, or as aseparate component, not shown.

Device 20 can also include at least one environmental sensor 108, whichat a minimum is a temperature sensor but can also include otherenvironmental sensors, such as a humidity sensor, ambient light sensor,magnetic compass, GPS receiver, etc. which determine respectiveenvironmental conditions to be controlled and/or monitored. Typically,when device 20 is an HVAC controller, environmental sensors 108 indevice 20 will include at least both a temperature sensor and a humiditysensor.

A communication module 112 connected to processor 100 to allow processor100 to communicate with communication networks such as the Internetand/or with additional external sensors or computerized devices (notshown). Preferably, communication module 112 is operable to connect tothe desired data networks wirelessly, via an antenna 116, using at leastone wireless communication protocol, such as WiFi; Bluetooth; ZigBee;ZWave; Cellular Data, etc., but it is also contemplated thatcommunication module 112 can have a wired connection to the datanetworks, such as via an Ethernet connection.

Communication module 112 also allows device 20 to communicate withInternet based services (such as weather servers, remote monitoringsystems, data logging servers, voice processing services, etc.) and withapplications used remotely by users of device 20 to monitor and controlthe controlled premises' environmental state or other conditions. Forexample, a user remote from device 20 may access an applicationexecuting on a smartphone or personal computer to send commands todevice 20, via the Internet or other data communications network orsystem, to alter the operation of device 20 or a system it iscontrolling.

Device 20 further includes a secondary processor assembly 120, which iscapable of digitizing and processing, as described in more detail below,audio signals received from at least one, and preferably two or more,microphones 124. In the present embodiment, secondary processor assembly120 is a DSP (digital signal processor) which can receive inputs frommicrophones 124 (which are located within housing 24 adjacent apertures36), digitize them and perform signal processing operations on thosedigitized signals in accordance with one or more programs stored withinthe DSP. While the current embodiment employs a single device DSP withthe required capabilities, it is also contemplated that secondaryprocessor assembly 120 can be constructed from two or more discretecomponents, if desired. It is also contemplated that secondary processorassembly 120 can be a separate computational core, or cores, included inprocessor 100.

Device 20 further includes a peripheral control block 128, which can beconnected to one or more control lines for a system to be controlled bydevice 20, such as an HVAC system, garage door opener, lighting system,etc. and peripheral control block 128 can receive signals from theconnected systems (such as the HVAC system) and/or output controlsignals thereto in accordance with one or more programs executed byprocessor 100.

Peripheral control block 128 can include mechanical, or solid state,relays to provide outputs to control lines, as well as a MUX or othersuitable devices for receiving relevant input signals from the HVAC orother controlled system and providing those signals to processor 100.

The hardware 42 on device 20 further includes an audio output subsystem132, which is operable in response to signals received from processor100, to output an amplified audio signal to a speaker system 136. Audiooutput subsystem 132 can be a discrete device, or combination ofsuitable discrete devices, as desired and is preferably capable ofoutputting voice signals and/or music or other sounds. Best seen in FIG.6, speaker system 136 includes a speaker driver 140 and a speaker cone142 (which are in communication with speaker grate 40, shown in FIGS. 2and 3). By virtue of being mounted along sloped area 52, the speakersystem 136 outputs its sound at an angle that is non perpendicular tothe back surface (i.e., trim plate 46 or the mounting wall), reducingthe amount of sound echoing back into the speaker system 136. When trimplate 46 is in use, a sloped surface 148 on the trim plate 46 helpsguide the outputted sound out through a gap 150 between trim plate 46and housing 24, directing the outputted sound away from both speakersystem 136 and microphone apertures 36. The absence or presence of trimplate 46 will affect the performance of speaker system 136 and thequality and volume of its outputted sound, as would the materialcharacteristics of the mounting wall (drywall, brick, plaster,cinderblock, etc.). As such, it is contemplated, that audio outputsubsystem 132 would include different sound output profiles based uponthe device 20's mounting surface. Each sound output profile would shapethe frequency and amplitude output of speaker system 136 for optimalperformance for its particular mounting surface. Selection of theappropriate sound output profile could be performed by the user viatouchscreen 32 or other appropriate means. Alternatively, device 20could preselect a sound output profile (i.e., “Trim plate present soundprofile/“Trim plate not present sound profile”) automatically based uponthe automated detection of a trim plate 46. Automatic detection could beprovide through the use of switches on the back of device 20 (not shown)that detect contact with a trim plate 46, or other appropriate means aswill occur to those of skill in the art.

User inputs to device 20 can be achieved via internet connectedapplications running on smartphones or the like, touchscreen 32 and/orresponses from cloud-based processing of voice commands received fromthe remote processing service by device 20.

When device 20 also serves as a voice command input device for suchcommands, a user's spoken voice commands are received by microphones 124and, as is described in more detail below, a representation of thatreceived audio is transmitted by device 20 over the internet or otherdata network to the remote processing service. The remote processingservice receives the transmitted representation of the audio anddetermines the meaning of the spoken voice commands and prepares anappropriate response which is then returned to device 20 for execution,or otherwise processed by another device or service.

Depending upon the range of services offered by the remote voiceprocessing service, the response to a spoken voice command can beselected from a wide range of responses. For example, the remoteprocessing service may have a limited set of available responses, alldirectly related to the control and operation of device 20, i.e.—thevoice command could have been a request to raise the temperature of theenvironment controlled by device 20, when serving as an HVAC controller,by one or more degrees and the response returned by the remote voiceprocessing service in such a case would be the necessary programcommands for device 20 to raise its target temperature by the one ormore degrees the user commanded, along with an audio stream of a voiceconfirmation.

In a more preferred embodiment, the remote voice processing service is abroadly capable system, such as the above-mentioned ALEXA Voice Service,and the voice commands which can be processed range far beyond thosespecifically related to the control and operation of device 20. Forexample, a user can ask for the current time and the remote voiceprocessing service will return an audio stream of a voice saying thecurrent time to device 20, along with the program commands necessary tohave that audio stream played to the user through speaker 136.

Similarly, the user may order fast food, such as a pizza, by voicecommand to device 20 and the remote voice processing service willcomplete the order, perhaps through an interactive set of audioexchanges with the user through microphones 124 and speaker 136 or inaccordance with predefined settings (size of pizza, toppings, paymentmethod, etc.) previously defined by the user, and will forward theresulting order through the internet to the pizza supplier whileconfirming the same to the user via an appropriate audio voice streamoutput at device 20.

In this regard, computerized device 20 can perform many or all of thefunctions of a voice command input device such as the Amazon Echodevice, typically used to interact with the ALEXA voice service, or thecorresponding Google Home device and service, etc. in addition toperforming it's other control functions, such as regulating temperatureand/or humidity in an environment.

However, unlike the above-mentioned Echo and/or Home devices,computerized devices such as device 20 face some specific challenges inproviding voice command services via a remote voice processing system.In particular, as mentioned above, known implementations of such voicecommand input devices require a user to preface any spoken command tothe system with a wake word or phrase, for example “ALEXA” for the Echoor “Okay Google” for the Google Home. Thus, an appropriate command tothe Echo might be, for example, “Alexa, please order me a pizza” whichwould start the above-mentioned interactive process to complete theorder. In contrast, “Order a pizza” would not invoke any response as thewake word was not present.

For an acceptable user experience, a voice command input device shouldhave a very low rate of False Acceptances (“FA”), defined as the casewhere the voice command device incorrectly determines that the wake wordhas been received, and a very low rate of False Rejections (“FR”),defined as the case where the voice command device misses (does notrecognize) that the wake word has been spoken to it.

To ensure acceptably low rates of FA's and FR's are obtained, existingvoice command input devices employ sophisticated, and computationallyexpensive (often referred to as “tight”), algorithmic models whichprocess captured audio streams to determine, with high probability, thepresence of the wake word. The actual tight model employed is notparticularly limited and new models are being developed all the time.

The voice command input device continually listens to its surroundingenvironment, processing the captured audio stream with the tightalgorithmic model and, when the presence of the wake word is determined,the voice command input device forwards an audio stream containing thewake word and the subsequently captured audio to the remote voiceprocessing service, as described above.

While such systems have proven to be very successful in providing anacceptable user experience, problems exist in attempting to implementsuch systems on computerized devices such as device 20. Specifically,the above-mentioned tight algorithmic models used to detect the presenceof the wake word require a computationally powerful processor running ata relatively high clock rate to execute the necessary computations, anda processor operating under such conditions draws relatively highamounts of power and generates correspondingly large amounts of wasteheat. Depending upon the actual construction and use of computerizeddevice 20, the computation requirements, power requirements and wasteheat produced can be significant problems.

For example, if computerized device 20 is an HVAC controller with one ormore onboard environmental sensors 108, as described above, the wasteheat generated by a processor executing a tight wake word recognitionmodel will affect the readings provided from environmental sensors 108and will have to be compensated for to obtain accurate temperaturereadings. A similar heating problem can be experienced for handheld orwearable computerized devices where the waste heat generated may make ituncomfortable for a user to hold or wear the computerized device.

Similarly, if computerized device 20 is battery powered, or is poweredfrom an otherwise limited supply, such as the low current 16 VAC controllines typically employed in HVAC systems, continued high clock rateoperation of the processor executing the tight wake word recognitionmodel may overload the power supply or too quickly deplete the battery.

Also, a processor which is capable of continually executing a tight wakeword recognition model, in addition to performing other tasks (such ascontrolling an HVAC or other system or devices) will have to be a morecomputationally powerful, and hence more expensive, processor than mightotherwise be required.

Thus, computerized device 20 employs a two stage approach to wake wordrecognition which can reduce the average computational processing andelectrical power requirements for the ongoing monitoring for the receiptof a wake word and which can correspondingly reduce waste heat generatedwithin device 20 and which can allow for the use of a lesscomputationally powerful processor.

Specifically, in device 20 secondary processor assembly 120 is a lowcapability (i.e.—computationally) processor, relative to processor 100,and has correspondingly lower electrical power consumption and wasteheat generation. Audio signals received via microphones 124 areprocessed by secondary processor assembly 120 to perform far field audioprocessing, such as echo cancellation, noise reduction, direction ofarrival, gain control, etc. as well as executing a low computationalcomplexity “loose” wake word recognition model.

In contrast to tight models, this loose model is selected and configuredfor execution by a low powered processor, such as secondary processorassembly 120, with the knowledge that the expected accuracy of the loosemodel will not meet the overall requirements for acceptably low rates ofFA's and FR's. Specifically, the selection criteria for the loose modelare not particularly limited as long as the loose model can be executedreasonably quickly (able to process received audio streams at least inreal time) and the selected model is configured to provide an FR ratewhich is acceptably low, albeit at the cost of having a high rate ofFA's.

FIG. 5 shows a data flow drawing of the processing of received audio atdevice 20. As shown, one or more steams of audio 200, from microphones124, are continually received at secondary processor assembly 120.

On an ongoing basis, secondary processor assembly 120 digitizes thereceived streams 200 and performs any other desired processing of thestreams (i.e.—combining two or more received streams into a singlestream, echo cancellation, beam forming, gain, etc.) to form a cleanedstream 204.

A copy of cleaned stream 204 is forwarded by secondary processorassembly 120 to a “look behind” circular buffer 208, implemented inprocessor 100, which stores a preselected length of the most recentlyreceived cleaned stream 204. In a present implementation, circularbuffer 208 stores about two seconds of the latest cleaned stream 204.

At this time, processor 100 is running at a reduced clock rate, selectedto be sufficient to service the processing load on processor 100 toimplement and update circular buffer 208 and to perform any othercomputational tasks required by the programs processor 100 is executingand processor 100 is not executing the tight model for wake wordrecognition.

At this reduced clock rate, both the electrical power requirements andwaste heat generated by processor 100 are reduced from the correspondinglevels which would be experienced when processor 100 is running at ahigher clock rate.

Secondary processor assembly 120 also processes cleaned stream 204 withthe loose wake word recognition model it is executing. As mentionedabove, this loose model is selected and configured to provide anacceptably low rate of FR's at the cost of a relatively high rate ofFA's, such that the probability of device 20 missing a user stating awake word is acceptably low.

Each time the loose wake word recognition model executed by secondaryprocessor assembly 120 determines (correctly or incorrectly) thatcleaned stream 204 contains the predefined wake word, secondaryprocessor assembly 120 provides a signal 212 to processor 100. In apresent implementation, signal 212 is an interrupt signal, but any othersuitable means of signaling processor 100 can be employed as desired.

Upon receipt of signal 212, processor 100 switches itself to operate ata suitably higher clock rate and processor 100 makes a copy of thecontents of circular buffer 208 to prevent them from being overwrittenby updates from cleaned stream 204. Processor 100 further creates, orreuses, a linear buffer 210 in which it stores the ongoing receivedcleaned stream 204 from secondary processor subassembly 120(representing received user speech which occurred after the copiedcontents of circular buffer 208). Thus, between the copy of the contentsof circular buffer 208 and the contents stored in linear buffer 210,processor 100 has a processed audio stream containing the suspected wakeword and any subsequent commands spoken by a user.

After copying the contents of circular buffer 208 and setting up andstarting to load linear buffer 210, processor 100 commences analysis ofthe contents of the copied contents of circular buffer 208 with thetight wake word recognition model it implements. The tight wake wordrecognition model processes the copy of the contents of circular buffer208 to determine, with both an acceptably low rate of FR's and FA's,whether or not those contents contain the predefined wake word.

As will be apparent, this processing by the tight model essentiallyverifies, with a higher level of confidence, the determination made bythe loose model used in secondary processor assembly 120. As previouslystated, the loose model is selected to provide a low level of missed(FR) wake word recognitions at the cost of a relatively high level offalse acceptances (FA).

By using this two stage approach where processor 100 verifies thepresence, or absence, of the wake word with a tight model, aftersecondary processor assembly 120 has used a loose model to determinethat a wake word has been received, a high probability of correctlydetecting the presence of the wake word is obtained, with a reducedaverage overall level of computation being required as the tight modelis not executed on a continuous basis as was the case with prior artvoice command input devices.

Thus, on average, the electrical power requirements for processor 100are reduced, as is the waste heat generated by processor 100. Further,by employing good real time programming practices in interrupting and/orsuspending the execution of other programmed tasks on processor 100 whensignal 212 is received, the required computational capabilities ofprocessor 100 can be reduced, from an amount required for continuouslyprocessing the tight model in addition to the other programs executingon processor 100 to an amount required for the as-needed processing ofthe tight model while the other programs executed by processor 100 aresuspended, or processed at a lower rate by processor 100. This allowsfor the use of a lower cost device to be employed as processor 100 thanwould be the case if the tight model were continuously executing onprocessor 100 in addition to the other programs executing on processor100.

If the tight model executed by processor 100 determines that the copy ofthe contents of the circular buffer did not contain the wake word(i.e.—the loose model executed by secondary processor subassembly 120made a False Acceptance), the copy of the contents of circular buffer208 is deleted, as is linear buffer 210 and processor 100 returns to alower clock rate operating mode. Alternatively, linear buffer 210 may beretained to be overwritten the next time processor 100 detects the wakeword, or even transmitted to a remote site for analysis and learning ofFA conditions.

If the tight model executed by processor 100 determines that the copy ofthe contents of circular buffer 208 does contain the wake word (i.e.—thedetermination by the loose model executed by secondary processorsubassembly 120 was correct) then processor 100: illuminates activityindicator 44 to provide a visual indication to the user that it hasreceived the wake word and is listening for further commands; connectsto a predefined remote voice processing service 216 (such as AmazonVoice Service) via communication module 112; and transfers the circularbuffer contents to voice processing service 216 and transfers thecurrent contents of linear buffer 210 (preferably at a rate higher thanthe rate at which new data is being added to linear buffer 210) andcontinues to transfer new data added to linear buffer 210 to voiceprocessing service 216.

By transferring the copied contents of circular buffer 208 and linearbuffer 210 contents, including any newly added content, voice processingservice 216 will receive a continuous portion of clean stream 204including the alleged wake word and any subsequent user voice commands.

In a presently preferred embodiment, upon receiving the relevant portionof clean stream 204, voice processing service 216 executes an eventighter, and more computationally expensive, algorithmic model toconfirm the presence of the wake word as a final verification. In thisembodiment, if the tighter model executed by voice processing service216 determines that the wake word is not in fact present in the data inreceives from processor 100, voice processing service 216 sends asignal, representing a “False Accept” condition to processor 100, viacommunications module 112, and processor 100 then extinguishes activityindicator 44, deletes the copy of circular buffer 208 and linear buffer210 and returns to a lower clock rate operating mode.

If the tighter model executed by voice processing service 216 confirmsthe presence of the wake word, or if this verification step is notdesired to be performed, voice processing service 216 continues toprocess the remainder of the received clean stream 204 to determine theuser's spoken voice commands.

Once the received spoken commands are determined and processed, voiceprocessing service 216 creates an appropriate response, or responses,for the determined commands and transmits those responses to the devicesand/or services appropriate for those responses. For example, if theuser voice commands were for computerized device 20 to alter one of theparameters or systems it is controlling, then a response is sent fromvoice processing service 216 to processor 100, via communication module112, altering the operation of processor 100, or causing processor 100to output a desired control signal output 220 to a connected module,component or device such as peripheral control block 128. Any responsesent to device 20 will also typically include a suitable voice responseconfirmation played to the user via audio output subsystem 132 andspeaker system 136.

As an example, if the user command was, “Alexa, please raise thetemperature by two degrees”, voice processing system 216 can send aresponse to processor 100 including commands to increase the targettemperature value stored in its memory by two degrees and to causeprocessor 100 to announce to the user, “I've raised the temperature bytwo degrees, as you requested” via audio output subsystem 132 andspeaker system 136.

If the user command was, “Alexa, please turn off the fan”, voiceprocessing system 216 can send a response to processor 100 includingcommands to produce control signal output 220 to peripheral controlblock 128 (or the like) to deactivate the HVAC circulating fan and tocause processor 100 to announce to the user, “I've turned the fan off,as you requested” via audio output subsystem 132 and speaker system 136.

If the user command was, “Alexa, I want to order a pizza”, voiceprocessing system 216 can send a response to processor 100 with commandsto initiate an appropriate interactive voice process to determine therelevant particulars for the pizza order and to place the order with anappropriate pizza supplier and to cause processor 100 to announce to theuser, “OK, I've ordered your pizza, as you requested” via audio outputsubsystem 132 and speaker system 136.

Once any voice command session is completed, voice processing system 216provides a session complete response to processor 100 which then deleteslinear buffer 210 and the copy of the contents of circular buffer 208and processor 100 returns to its low power/reduced clock rate operatingmode.

While the above-described example of a computerized device in accordancewith the present invention is shown as including a set of componentsappropriate for the HVAC controller of the example, it should berealized that the present invention is not so limited. For example, ifcomputerized device 20 is a garage door opener, touchscreen 32 andenvironmental sensors 108 can be omitted. Further, the above-describedexample does not explicitly illustrate the electrical power supply forcomputerized device 20. In the illustrated example of an HVACcontroller, it is contemplated that typically computerized device 20will be powered parasitically, from power on the HVAC system's controllines. However it is also contemplated that computerized device 20 canbe powered from batteries, a separate power supply, etc.

As should now be apparent from the description above, the presentinvention provides a system and method for recognizing a wake wordprefix for a spoken command which is to be processed by a remote voiceprocessing system. The system and method employ a low power processor,with relatively limited processing capability, to execute a looserecognition model which provides a low FR rate, but which iscorrespondingly subject to an otherwise unacceptably high FA rate. Whenthis loose recognition model determines that the wake word is present ina buffered and stored audio stream, it signals a second, application,processor which is capable of operating at at least a low clock rate anda high clock rate, where the high clock rate results in the applicationprocessor being more computationally capable than when operating at thelower clock rate and being more computationally capable than the lowpower processor, but which also increases the electrical powerrequirements of the application processor and the amount of waste heatit generates.

Upon receiving the signal, the application processor switches itsoperation to the higher clock rate and executes a tight recognitionmodel and analyses the contents of the buffered and stored audio streamto verify the presence of the wake word. If the wake word is notpresent, the application processor returns to operating at the lowerclock rate and deletes the buffered and stored audio stream.

If the application processor confirms the presence of the wake word inthe buffered and stored audio stream, the buffered and stored audiostream, and subsequent captured audio stream of user commands, isforwarded to the voice processing service for further processing.

This system and method of using a low power processor, and a second,more capable application processor which can operate in at least a lowpower/low clock rate mode and a high power/high clock rate mode, reducesthe average overall electrical power required by the system and method,as well as correspondingly reducing the waste heat generated by thesystem, and can reduce the level of computational processing capabilityrequired by the application processor, thus reducing its cost.

The above-described embodiments of the invention are intended to beexamples of the present invention and alterations and modifications maybe effected thereto, by those of skill in the art, without departingfrom the scope of the invention which is defined solely by the claimsappended hereto.

I claim:
 1. A method of recognizing a wake word for voice commands to acomputerized device, comprising the steps of: (i) receiving at thecomputerized device an audio signal from at least one microphone; (ii)processing the received audio signal with a first processor in thecomputerized device, the first processor placing a copy of the processedreceived audio signal into a circular buffer of a preselected size andthe first processor executing a first voice recognition algorithmicmodel to detect the presence of a predefined wake word, the first voicerecognition algorithmic model selected to provide a predefined relativelow level of false non-matches of the predefined wake word at the costof a higher than predefined level of false matches of the predefinedwake word; (iii) upon the first processor determining a match of thepredefined wake word, the first processor providing a second signal to asecond processor in the computerized device, the second processornormally operating at a first rate having a first computationalcapacity, the second signal causing the second processor to commenceoperating at a second rate having a second computational capacitygreater than the first computational capacity, and the second processor:(a) copying the contents of the circular buffer into a second buffer;(b) copying the processed received audio signal into a third buffer; (c)executing a second voice recognition algorithmic model on the contentsof the second buffer to verify the presence of the predefined wake word,the second voice recognition algorithmic model requiring greatercomputational processing than the first voice recognition algorithmicmodel to achieve a predefined relatively low level of both falsenon-matches and false matches of the predefined wake word; and (d) uponcompletion of analyzing the contents of the second buffer with thesecond voice recognition algorithmic model, if the second voicerecognition algorithmic model determines that the predefined wake wordis not present in the second buffer, returning the second processor tooperate at the first rate and, if the second voice recognitionalgorithmic model determines that the predefined wake word is present inthe second buffer, then forwarding the contents of the second buffer andthe third buffer to a voice processing service located remote from thecomputerized device, the voice processing service operable to receiveand process voice commands.
 2. The method of claim 1 further comprisingthe step of the second processor activating an activity indicator whenthe second processor determines, at (c) that the predefined wake word ispresent in the second buffer.
 3. The method of claim 1 where the voiceprocessing service executes a third voice recognition algorithmic modelrequiring greater computational processing than the second voicerecognition algorithmic model, the voice processing service executingthe third voice recognition algorithmic model on the copy of the secondbuffer received at the voice processing service to verify the presenceof the wake word therein and, if the third voice recognition algorithmicmodel does not verify the presence of the wake word, then the voiceprocessing service sending a message to the computerized deviceindicating that the wake word was not present and the second processorreturning to operating at the first rate and, if the third voicerecognition algorithmic model does verify the presence of the wake word,then the voice processing service processing the contents of the thirdbuffer.
 4. A computerized device comprising: at least one microphone tocapture user voices; a first processor to digitize and process audioreceived from the at least one microphone and to store a copy of theprocessed audio in a circular buffer and to execute a first voicerecognition algorithmic model to detect the presence of a predefinedwake word in the circular buffer, the first voice recognitionalgorithmic model selected to provide a predefined relative low level offalse non-matches of the predefined wake word at the cost of a higherthan predefined level of false matches of the predefined wake word; asecond processor normally operating at a first rate having a firstcomputational capacity and responsive to a signal from the firstprocessor indicating that the first voice recognition algorithmic modelhas detected the presence of the wake word in the circular buffer suchthat the second processor commences operation at a second rate having agreater computational capacity than the capacity at the first rate, thesecond processor receiving a copy of the contents of the circular bufferfrom the first processor and receiving and buffering a copy of theprocessed received audio stream in a second buffer, the second processorexecuting a second voice recognition algorithmic model on the copy ofthe contents of the circular buffer to verify the presence of thepredefined wake word, the second voice recognition algorithmic modelrequiring greater computational processing than the first voicerecognition algorithmic model and being selected to achieve a predefinedrelatively lower level of both false non-matches and false matches ofthe predefined wake word than achieved by the first processor; and adata communications module operable to provide data communicationbetween the computerized device and a remote voice processing service,the data communications providing the voice processing service with thecopy of the contents of the circular buffer and the contents of secondbuffer to the voice processing service when the second voice recognitionalgorithmic model verifies the presence of the wake word in the copy ofthe contents of the circular buffer.